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Abstract. A linear graph is a graph whose vertices are totally ordered. 
Biological and linguistic sequences with interactions among symbols are 
naturally represented as linear graphs. Examples include protein con- 
tact maps, RNA secondary structures and predicate-argument struc- 
tures. Our algorithm, linear graph miner (LGM), leverages the vertex 
order for efficient enumeration of frequent subgraphs. Based on the re- 
verse search principle, the pattern space is systematically traversed with- 
out expensive duplication checking. Disconnected subgraph patterns are 
particularly important in linear graphs due to their sequential nature. 
Unlike conventional graph mining algorithms detecting connected pat- 
terns only, LGM can detect disconnected patterns as well. The utility and 
efficiency of LGM are demonstrated in experiments on protein contact 
maps. 



1 Introduction 

Frequent subgraph mining is an active research area with successful applications 
in, e.g., chemoinformatics [13], software science g], and computer vision [T3] . 
The task is to enumerate the complete set of frequently appearing subgraphs in 
a graph database. Early algorithms include AGM [8], FSG [9] and gSpan [19] . 
Since then, researchers paid considerable efforts to improve the efficiency, for 
example, by mining closed patterns only [20) . or by early pruning that sacrifices 
the completeness (e.g., leap search |18|). However, graph mining algorithms are 
still too slow for large graph databases (see e.g., [IT])- The scalability of graph 
mining algorithms is much worse than those for more restricted classes such as 
trees [1] and sequences [14] • It is due to the fact that, for trees and sequences, 
it is possible to design a pattern extension rule that does not create duplicate 
patterns (e.g., rightmost extension) pQ. For general graphs, there are multiple 
ways to generate the same subgraph pattern, and it is necessary to detect du- 
plicate patterns and prune the search tree whenever duplication is detected. In 
gSpan [19], a graph pattern is represented as a DFS code, and the duplication 
check is implemented via minimality checking of the code. It is a very clever 
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Fig. 1. An example of linear graph 



mechanism, because one does not need to track back the patterns generated so 
far. Nevertheless, the complexity of duplication checking is exponential to the 
pattern size |19) . It harms efficiency substantially, especially when mining large 
patterns. 

A linear graph is a graph whose vertices are totally ordered |3I5] (Figure [J). 
For example, protein contact maps, RNA secondary structures, alternative splic- 
ing patterns in molecular biology and predicate-argument structures [llj in nat- 
ural languages can be represented as linear graphs. Amino acid residues of a 
protein have natural ordering from N- to C-terminus, and English words in a 
sentence are ordered as well. Davydov and Batzoglou [3 addressed the problem 
of aligning several linear graphs for RNA sequences, assessed the computational 
complexity, and proposed an approximate algorithm. Fertin et al. assessed the 
complexity of finding a maximum common pattern in a set of linear graphs [5] . 
In this paper, we develop a novel algorithm, linear graph miner (LGM), for enu- 
merating frequently appearing subgraphs in a large number of linear graphs. The 
advantage of employing linear graphs is that we can derive a pattern extension 
rule that does not cause duplication, which makes LGM much more efficient 
than conventional graph mining algorithms. 

We design the extension rule based on the reverse search principle [2]. Per- 
haps confusingly, 'reverse search' does not refer to a particular search method, 
but a guideline for designing enumeration algorithms. A pattern extension rule 
specifies how to generate children from a parent in the search space. In reverse 
search, one specifies a rule that generates a parent uniquely from a child (i.e., 
reduction map). The pattern extension rule is obtained by 'reversing' the re- 
duction map: When generating children from a parent, all possible candidates 
are prepared and those mapping back to the parent by the reduction map are 
selected. An advantage of reverse search is that, given a reduction map, the 
completeness of the resulting pattern extension rule can easily be proved [2]. In 
data mining, LCM, one of the fastest closed itemset miner, was designed using 
reverse search [16]. It is applied in the design of a dense module enumeration 
algorithm 6 and a geometric graph mining algorithm recently |12j . In compu- 
tational geometry and related fields, there are many successful applications^ 
LGM's reduction map is very simple: remove the largest edge in terms of edge 
ordering. Fortunately, it is not necessary to take the "candidate preparation and 
selection" approach in LGM. We can directly reverse the reduction map to an 
explicit extension rule here. 



4 See a list of applications at http: //cgm. cs .mcgill . ca/~avis/doc/rs/applications/ index .html 



Linear graphs can be perceived as the fusion of graphs and sequences. Se- 
quence mining algorithms such as Prefixspan |14) can usually detect gaped se- 
quence patterns. In applications like motif discovery in protein contact maps [TJ, 
it is essential to allow "gaps" in linear graph patterns. More precisely, discon- 
nected graph patterns should be allowed for such applications. Since conventional 
graph mining algorithms can detect only connected graph patterns, their appli- 
cation to contact maps is difficult. In this paper, we aim to detect connected and 
disconnected patterns with a unified framework. 

In experiments, we used a protein 3D-structure dataset from molecular biol- 
ogy. We compared LGM with gSpan in efficiency, and found that LGM is more 
efficient than gSpan. It is surprising to us, because LGM detects a much larger 
number of patterns including disconnected ones. To compare the two methods 
on the same basis, we added supplementary edges to help gSpan to detect a 
part of disconnected patterns. Then, the efficiency difference became even more 
significant. 

2 Preliminaries 

Let us first define linear graphs and associated concepts. 

Definition 1 (Linear graph) Denote by E v and E E the set of vertex and 
edge labels, respectively. A labeled and undirected linear graph g = (V, E, L v , L E ) 
consists of an ordered vertex set V C N, an edge set E C V x V , a vertex labeling 
L v : V -t S v and an edge labeling L E : E -> S B . Let the size of the linear 
graph \g\ be the number of its edges. Let Q denote the set of all possible linear 
graphs and let 9 G Q denote the empty graph. 

The difference from ordinary graphs is that the vertices are defined as a subset 
of natural numbers, introducing the total order. Notice that we do not impose 
connectedness here. The order of edges is defined as follows: 

Definition 2 (Total order among edges) Vex = (i,j),&2 — {k,l) G E g , e\ < e 
e2 if and only if i) i < k or ii) i = k,j < I. 

Namely, one first compares the indices of the left nodes. If they are identical, the 
right nodes are compared. The subgraph relationship between two linear graphs 
is defined as follows. 

Definition3 (Subgraph) Given two linear graphs g\ = (Vi, E\, L Vl , L El ), 

.92 = (V2, E2, L V2 , L E2 ), gi is a subgraph of 52, ffi Q .92, if and only if there 
exists an injective mapping m : V\ — > V2 such that 

1. Vi G V\ : L Vl (i) = L V2 (m(i)), vertex labels are identical, 

2. V(t,j) G E l : (m(*),m(j)) G E 2 ,L E ^(i,j) = L E2 (m(i), m(j)), all edges of g t 
exist in 52, and 

3. V(i,j) G Ex : i < j — > m(i) < m(j), the order of vertices is conserved. 




Fig. 2. (Left) Graph-shaped search space. (Right) Search tree induced by the reduction 
map 
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Fig. 3. Example of children patterns. There are three types of extension with respect 
to the number of nodes: (A) no-node-addition, (B) one-node-addition, (C) two-nodes- 
addition. 



The difference from the ordinary subgraph relation is that the vertex order is 
conserved. Finally, frequent subgraph mining is defined as follows. 

Definition 4 (Frequent linear subgraph mining) For a set of linear graphs 
G = {gi, ■ • ■ ,g\G\}?9i £ 5, a minimum support threshold a > and a maximum 
pattern size s > 0, find all g G Q such that g is frequent enough in G, i.e., 

\{i = l,...,\G\:gCg i }\>a,\g\<s 



3 Enumeration of Linear Subgraphs 

Before addressing the frequent pattern mining problem, let us design an algo- 
rithm for enumerating all subgraphs of a linear graph. For simplicity, we do not 
consider vertex and edge labels in this section, but inclusion of the labels is 
straightforward. 



3.1 Reduction Map 



Suppose we would like to enumerate all subgraphs in a linear graph shown in the 
bottom of Figure^ left. All linear subgraphs form a natural graph-shaped search 
space, where one can traverse upwards or downwards by deleting or adding an 
edge (Figure [2 left). For enumeration, however, one has to choose edges in the 
search graph to form a search tree (Figure 2, right). Once a search tree is defined, 
the enumeration can be done either by depth-first or breadth-first traversal. To 
this aim, we specify a reduction map f : Q — >• Q which transforms a child to its 
parent uniquely. The mapping is chosen such that when it is applied repeatedly, 
we eventually reduce it to an element of the solution set S C Q. Formally, we 
write Vx e Q : 3k < : f k {x) e S. In our case, the reduction map is defined as 
removing the "largest" edge from the child graph. The largest edge is defined via 
the total order introduced in Definition [2] By evaluating the mapping repeatedly 
the graph is shrunk to the empty graph. Thus, here we have S = {8}. 

By applying f(g) for all possible g G G, we can induce a search tree with 9 € Q 
being the root node, shown in Figure [2j right. A question is if we can always 
define a unique search tree for any linear graph. The reverse search theorem [5] 
says that the proposition is true iff any node in the graph-shaped search space 
converges to the root node (i.e., empty graph) by applying the map a finite 
number of times. For our reduction map, it is true, because each possible linear 
graph g £ Q is reduced to the empty graph by successively applying / to g. 

A characteristic point of reverse search is that the search tree is implicitly 
defined by the reduction map. In actual traversal, the search tree is created on 
demand: when a traverser is at a node with graph g and would like to move 
down, a set of children nodes are generated by extending g. More precisely, one 
enumerate all linear graphs by inverting the reduction mapping such that the 
tree is explored from the root node towards the leaves. 

The inverse mapping f^ 1 : Q — > Q* generates for a given linear graph g G Q 
a set of extended graphs X = {g' \ f(g') — g}. 

There are three types of extension patterns according to the number of added 
nodes in the reduction mapping: (A) no-node-addition, (B) one-node-addition, 
(C) two-nodes-addition. Let us define the largest edge of g as (i, j), i < j. Then, 
the enumeration of case A is done by adding an edge which is larger than (i, j). 
For case B, a node is inserted to the position after i, and this node is connected to 
every other node. If the new edge is smaller than (i, j), this extension is canceled. 
For case C, two nodes are inserted to the position after i. In that case, the added 
two nodes must be connected by a new edge. All patterns of valid extensions are 
shown in Figure [3] This example does not include node labels, but for actual 
applications, node labels need to be enumerated as well. 

4 Frequent Pattern Mining 

In frequent pattern mining, we employ the same search tree described above, 
but the occurrence of a pattern in all linear graphs are tracked in an occurrence 



Algorithm 1 Linear Graph Miner (LGM) 
Input: 

A set of linear graphs: G = {gi, g\a\\ 

Minimum support: a > 

Maximum pattern size: s > 
1: function LGM(G, a, s) > the main function 
2: MlNE(G, <j>, a, s) 
3: end function 
4: function MlNE(G,p, a, s) 
5: sup <f- support(LG(g)) 

6: if sup < o then >check support condition 

7: return 
8: end if 

9: Report occurrence of subgraph g 
10: if \g\ = s then >check pattern size 

11: return 
12: end if 

13: scan G once by using Lg(sO, find all extensions f~ 1 {g) 
14: for g'er\g) 
15: MlNE(G,g',a, s) 

ocall Mine for every extended pattern g' 

16: end for 
17: end function 



list La(g) [19], defined as follows: 

La(g) — {(i, m) : J; £ G,j C g i with node correspondence m}. 

When a pattern g is extended, its occurrence list Lc{g) is updated as well. 
Based on the occurrence list, the support of each pattern g, i.e., the number of 
linear graphs which contains the pattern, is calculated. Whenever the support is 
smaller than the threshold s, the search tree is pruned at this node. This pruning 
is possible, because of the anti-monotonicity of the support, namely the support 
of a graph is never larger than that of its subgraph. Algorithm [1] describes the 
recursive algorithm for frequent mining. In line 13, each pattern g is extended 
to larger graphs g' G by inverse reduction mapping / . The possible 

extensions f~ 1 (g) for each pattern g are found using the location list Lc(g). 
The function Mine is recursively called for each extended pattern g' £ f~ 1 {g) 
in line 15. The graph pruning happens in lines 7, if the support for the pattern 
g is smaller than the minimum support threshold a or in line 11 if the pattern 
size | <7 1 is equal to the maximum pattern size s. 

5 Complexity Analysis 

The computational time of frequent pattern mining depends on the minimum 
support and maximum pattern size thresholds [19] . Also, it depends on the "den- 
sity" of the database: If all graphs are almost identical (i.e., a dense database), 



1-gap lineargraph 


2-gap lineargraph 







Fig. 4. Example of gap linear graph. 1-gap linear graph (left) and 2-gap linear 
graph (right) are represented, respectively. Edges corresponding to gaps are represented 
in bold line. 



the mining would take a prohibitive amount of time. So, conventional worst case 
analysis is not amenable to mining algorithms. Instead, the delay, interval time 
between two consecutive solutions, is often used to describe the complexity. Gen- 
eral graph mining algorithms including gSpan are exponential delay algorithms, 
i.e., the delay is exponential to the size of patterns [19]. The delay of our algo- 
rithm is only polynomial, because no duplication checks are necessary thanks to 
the vertex order. 

Theorem 1 (Polynomial delay). For N linear graphs G, a minimum support 
a > 0, and a maximum pattern size s > 0, the time between two successive calls 
to Report in line 9 is bounded by a polynomial of the size of input data. 

Proof. Let M := max^ \V gi \, F := max^ \E gi \. The number of matching locations 
in the linear graphs G can decrease in case g is enlarged, because the only largest 
edge is added. Considering the number of variations, it is easy to see that the 
location list always satisfies \Lc(g)\ < M 2 N . Therefore, the mapping f~ 1 (g) can 
be produced in 0(M 2 N) time, because the procedure searches for the location 
list in line 13. 

The time complexity between two successive calls to Report can now be 
bounded by considering two cases after Report has been called once. 

— Case 1. There is an extension g' fulfilling the minimum support condition, or 
the size of g' is s. Then Report is called within 0(M 2 N) time. 

— Case 2. There is no extension g' fulfilling the minimum support condi- 
tion. Then, no recursion happens and Mine returns in 0(M 2 N) time to its 
parent node in the search tree. The maximum number of times this can hap- 
pen successively is bounded by the depth of the reverse search tree, which 
is bounded by O(F), because each level in the search tree adds one edge. 
Therefore, in 0(M 2 NF) time the algorithm either calls Report again or 
finishes. 

Thus, the total time between two successive calls to Report is bounded by 
Q(M 2 NF). 
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Fig. 5. Execution time for the pro- 
tein data. The line labeled by 
gSpan+gl is execution time for 
gSpan on the 1-gap linear graph 
dataset. gSpan does not work on 
the 2-gap linear graph dataset even 
if the minimum support threshold 
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6 Experiments 

We performed a motif extraction experiment from protein 3D structures. Fre- 
quent and characteristic patterns are often called "motifs" in molecular biology, 
and we adopt that terminology here. All experiments were performed on a Linux 
machine with an AMD Opteron processor (2 GHz and 4GB RAM). 

6.1 Motif extraction from protein 3D structures 

We adopted the Glyankina et al's dataset [7] which consists of pairs of homol- 
ogous proteins: one is derived from a thermophilic organism and the other is 
from a mesophilic organism. This dataset was made for understanding struc- 
tural properties of proteins which are responsible for the higher thermostability 
of proteins from thermophilic organisms compared to those from mesophilic or- 
ganisms. In constructing a linear graph from a 3D structure, each amino acid is 
represented as a vertex. Vertex labels are chosen from {1, . . . , 6}, which repre- 
sents the following six classes: aliphatic {AVLIMC}, aromatic {FWYH}, polar 
{STNQ}, positive {KR}, negative {DE}, special (reflecting their special con- 
formation properties) {GP} [TO]. An edge is drawn between the pair of amino 
acid residues whose distance is within 5 angstrom. No edge labels are assigned. 
In total, 754 graphs were made. Average number of vertices and edges are 371 
and 498, respectively, and the number of labels is 6. To detect the motifs char- 
acterizing the difference between two organisms, we take the following two-step 
approach. First, we employ LGM to find frequent patterns from all proteins of 
both organisms. In this setting, we did not use (c-6) patterns in Figure 3. Fi- 
nally, the patterns significantly associated with organism difference are selected 
via statistical tests. 

We assess the execution time of our algorithm in comparison with gSpan. 
The linear graphs from 3D-structure proteins are not always connected graphs 
and the gSpan can not be applied to such disconnected graphs. Hence, we made 
two kinds of gaped linear graph: 1-gap linear graph and 2-gap linear graph. 1-gap 
linear graph is a linear graph whose contiguous vertices in a protein sequence 
are connected by an edge; 2-gap linear graph is a 1-gap linear graph whose two 
vertices skipping one in a protein sequence are connected by an edge (Figure H]). 
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Fig. 6. Significant subgraphs detected by LGM. The p-value calculated by fisher exact 
test is attached to each linear graph. The node labels 1, 2, 3, 4 and 5, represent aliphatic, 
aromatic, polar, positive and negative proteins, respectively. 




Fig. 7. 3D-structures of TATA-binding protein(left) and human pol II promotor protein 
(right) . The spheres represent the amino acid residues corresponding to vertices forming 
subgraphs in figure [5] 



We run gSpan on two datasets: one consists of 1-gap linear graphs and the other 
consists of 2-gap linear graphs. We run LGM on the original linear graphs. We 
set the maximum execution time to 12 hours for both programs. Figure [5] shows 
the execution time by changing minimum support thresholds. gSpan does not 
work on the 2-gap linear graph dataset even if the minimum support threshold 
is 50. Our algorithm is faster than gSpan on the 1-gap linear graph dataset, and 
its execution time is reasonable. 

Then, we assess a motif extraction ability of our algorithm. To choose signif- 
icant subgraphs from the enumerated subgraphs, we use Fisher's exact test. In 
this case, a significant subgraph should distinguish thermophilic proteins from 
mesophilic proteins. Thus, for each frequent subgraph, we count the number of 
proteins containing this subgraph in the thermophilic and mesophilic proteins; 
and generate a 2 x 2 contingency table, which includes the number of thermophilic 
organisms that contain subgraph g nxp, the number of thermophilic organisms 



that does not contain a subgraph g npp, the number of mesophilic organisms 
that does not contain a subraph g rip^ and the number of mesophilic organisms 
that contain a subgraph g utn- The probability representing the independence 
in the contingency table is calculated as follows: 



where np is the number of thermophilic proteins; the number of mesophilic 
proteins; n g the number of proteins with a subgraph <?; n g > the number of proteins 
without a subgraph g' . The p- value of the two-sided Fisher's exact test on a table 
can be computed by the sum of all probabilities of tables that are more extreme 
than this table. 

We ranked the frequent subgraphs according to the p-values, and obtained 
103 subgraphs whose p-values are no more than 0.001. Here, we focused on a 
pair of proteins, TATA-binding protein and human polll promoter protein, where 
TATA-binding protein is derived from a thermophilic organism and human polll 
promotor is from a mesophilic organism. The reason we chose these two proteins 
is that they include a large number of statistically significant motifs which are 
mutually exclusive between two organisms. These two proteins share the same 
function as DNA-binding protein, but their thermostabilities are different. Fig- 
ure [5] shows the top-3 subgraphs in significance. Figure [7] shows 3D-structure 
proteins, TATA-binding protein (left) and human polll promotor protein(right), 
and the amino acid residues forming top3-subgraphs are represented by spheres. 

7 Conclusion 

We proposed an efficient frequent subgraph mining algorithm from linear graphs. 
A key point is that vertices in a linear graph are totally ordered. We designed a 
fast enumeration algorithm from linear graphs based on this property. For an ef- 
ficient enumeration without duplication, we define a search tree based on reverse 
search techniques. Different from gSpan, our algorithm enumerates frequent sub- 
graphs including disconnected ones by traversing this search tree. Many kinds 
of data, such as protein 3D-structures and alternative splicing forms, which can 
be represented as linear graphs, include disconnected subgraphs as important 
patterns. The computational time of our algorithm is polynomial-delay. 

We performed a motif extraction experiment of a protein 3D-structure dataset 
in molecular biology. In the experiment, our algorithm could extract important 
subgraphs as frequent patterns. By comparing our algorithm to gSpan with re- 
spect to execution time, we have shown our algorithm is fast enough for the real 
world datasets. 

Data which can be represented as linear graphs occur in many fields, for 
instance bioinformatics and natural language processing. Our mining algorithm 
from linear graphs provide a new way to analyze such data. 
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